This analysis focuses on the behaviors of casual riders and subscription members who rode Cyclistic bikes in Chicago through their bike sharing service. The company would like to convert more casual riders to subscriptions, and tasked me with finding the ways in which the two groups behave differently. The data has a single observation for each ride. The key information that is missing from this data is a way to identify the riders themselves: I will not be able to connect the individual rides to the person riding the bike.
I am using one year of data from divvy-tripdata for the time period from May 2024 through April 2025. This was the most recent data available when I first loaded the data into my table on June 2nd.
It was made available under this license: Data License Agreement | Divvy Bikes
My goal for this project is to establish ways in which behavior differed between casual riders and subscription members, and to identify patterns that could potentially be relevant to the business goal of increasing the number of subscription members as a share of all riders.
I began this project by previewing the first and last months of the data in RStudio in order to determine the schema, and whether the columns were the same for the entire year. I then used PostgreSQL (by way of the query tool in pgAdmin GUI) to clean, transform, and analyze the data. First, I created a table for the entire year. I specified all the datatypes for the table, and included a check to ensure the locations specified by the starting and ending latitude and longitude columns were valid. Once the table was specified, I loaded the data into it using SQL. The table is composed of 12 CSV files organized consecutively.
I then checked the data for cleanliness and validity. I used COUNT DISTINCT and GROUP BY statements to determine the number of unique values for the character fields, and searched for null values in every column. The data is very clean, and every observation has a value in the primary key column.
I used a few more basic queries to check for the total number of observations (5,735,884) and to generally explore and get a feel for the data.
The data table contains the following dimensions: ride type, start station, end station, and whether the rider is a subscription member; plus the start time and end time. There was no quantitative data immediately available in the data, and the raw timestamps were not useful in their current form. It was necessary to create some calculated columns.
I started with the quantitative measurements, because these can easily be used in every query without complicated grouping logic. I came up with three:
Number of rides, derived using the COUNT function.
Trip duration, derived by subtracting the start timestamp from the end timestamp and converting the interval to an EPOCH numeric datatype.
Absolute distance traveled, derived through a simple trigonometric formula using the starting and ending longitude and latitude. This formula operated on the assumption of a flat plane of coordinates, which I believe is appropriate for a set of locations concentrated in a single city.
I then ran a series of queries grouping the data with SQL by both membership cohort and each of the qualitative columns, with the three measurements included. This yielded limited insight, so I began transforming the qualitative columns as well. I created the following calculated qualitative columns for my analysis:
Month, extracted from the ride start timestamp
Day of week, extracted from the ride start timestamp
Hour, extracted from the ride start timestamp
Trip direction, derived using a CASE statement with a trigonometric function from the starting and ending longitude and latitude
Neighborhood. This was created by spatial-joining my table to a GIS geometry table that I found on the City of Chicago website (link: Boundaries - Neighborhoods | City of Chicago | Data Portal). I chose neighborhoods rather than zip codes or wards because they are more recognizable.
I created summary tables with SQL that featured each of these, and made simple visualizations for each to help with my understanding. For these, I also isolated rides where the rider did not actually go anywhere (likely never used the bike at all), and then updated my queries to exclude these zero-distance rides when I failed to find a useful pattern for when they happened.
I also attempted to load the entire table into Tableau to allow me to more easily manipulate the data on the fly. However, the program would not run smoothly with a data table so large, especially once calculated fields were added. Nearly all data transformation was done through SQL.
With my relevant columns having been created, I made an overview table in Google Sheets to summarize all of the relationships I was able to measure. I highlighted significant relationships in red, and small but potentially useful relationships in a light pink.
(this table is also included in my repository as it may be too difficult to read in this format) This gave me a somewhat comprehensive look at what is in the dataset. In terms of completeness, the logical next step would be to analyze combinations of multiple qualitative columns, but that could require hundreds of separate queries and charts, and likely a lot of wasted time.
This first analysis yielded some concrete insights about rider behavior, which I outline below, supported with visualizations from Tableau.
Approximately 65% of trips were taken by subscription members. Casual riders took longer trips (in terms of ride duration) than subscription members. Both groups travelled similar distances.
The other relationships between variables were less relevant to this analysis, so I focused on combining the most useful qualitative data.
I grouped the data by both day of week and time of day. First I determined that most of the difference in the numbers of casual rides and subscription members happened during the work week.
Then I established a distinct difference between rider behavior during the work week (Monday through Friday) and the weekend. There was a clear pattern to rider behavior between these days, and though the pattern was more pronounced for subscription members, it was still clearly observable for casual riders.
Weekend rides gradually increased from the early morning to a flat peak from 11am to 5pm, while work week rides showed the hourly pattern where rides had a sharp peak from 7-9am, and then a larger sharp peak from 3-7pm.
I also bucketed the 97 Chicago neighborhoods into three categories: Central, Inner Ring, and Outer Ring. These categories showed noticeable differences in rider behavior.
On the map, this is how they are oriented:
Rides are broken down between these neighborhood categories as shown above.
Below is how they were distributed:
After reviewing this dataset in a comprehensive manner, I propose the following business decisions. These may rely on other factors, but they are rooted in insights from this data.